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Abstract. The past year has seen a rise in the profile of 'Big Data' in both 
the public and private sectors. This paper examines the generation and use 
of spatial Big Data through Location- Based Services and Social Networks. 
Drawing on interviews of LBS/LBSN designers and developers, the paper 
introduces the concept of Data Fumes - where one application leverages 
the data produced by another. The reliance upon Data Fumes is shown to 
have profound effects on what can be known and done with spatial Big Da- 
ta. Finally, the paper suggests spatial Big Data should be examined with an 
eye to the earl i er debate of "systems vs. sci ence" withi n the Gl S community. 
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1. Introduction 

In late J anuary of 2012, Daniel Rasmus, a writer for Fast Company, pre- 
dicted 2012 to be the "year of Big Data" (Rasmus 2012). From marketing 
(Baker 2012) to health care (Cerrato 2012) to the national sci ence funding 
agencies (NSF Press Release 2012), Big Data and its related models and 
analytical approaches have risen drastically in prominence. This brief paper 
examines the generation of spatial Big Data through the use of Location- 
Based Services (LBS) and Social Networks (LBSN). It proceeds in three 
parts: First, drawing from ethnographic interviews of mobile application 
designers and developers the term "Data Fumes" is introduced and defined 
as the predominant orientation of private companies creating LBS/LBSNs. 
Second, two effects this orientation has upon the data produced are demon- 
strated. On the one hand, both what information Data Fumes contain and 
who has access to them is controlled by an extremely small set of designers 
working for private corporations whose decisions promulgate through the 



mobile application ecosystem. On the other, the "location" produced in spa- 
tial Big Data sets is separated from physical location and commoditized. 
The data generated through LBS and LBSN use is driven by a motive for 
profit both from the end-user and corporation, this motivation shapes the 
very definition and understanding of "location." Finally, the paper con- 
cludes by drawing parallels between concerns in spatial Big Data research 
and the earlier "Systems versus Science" debates in the GIS community. 
Current research into LBS, LBSNs and Big Data more broadly will be well 
served by payi ng heed to these earlier discussions and research. 1 



2. Spatial Big Data and Data Fumes 



Before turning to the concept of Data Fumes, it's necessary to briefly define 
how Big Data is understood with respect to spatial information in this pa- 
per. From a technical perspective, data has always been big (Farmer and 
Poznoukhov 2012). As such "Big Data" presents an ever-shifting target that 
has less to do with size and more with the ability to rapidly combine, aggre- 
gate, and analyze diverse and disparate sets of information. Following J a- 
cobs (2009, 39), the "pathologies of big data are primarily those of analy- 
sis." Beyond the technical, the rapid growth of Big Data approaches and the 
acceptance of the Big Data movement in both private firms and academic 
research belies a socio- technical phenomenon that accepts certain episte- 
mological and cultural views of scientific praxis and the nature of reality. I n 
this sense, Big Data represents the latest iteration of the desire to find effi- 
ciency and meaning in quantitative analysis. Big Data requires a belief that 
life can be captured and modeled by data or even fully transformed into it 
(boyd and Crawford 2012; Berry 2011). With reality accurately captured in 
ever larger data-sets, scientific praxis transforms into a new "fourth para- 
digm" of manipulation and exploration (Heyetal. 2009). Large numbers of 
potential correlations are equally considered, in contrast to traditional 
methods driven from a theoretically based hypothesis and a small number 
of testable variables (Batty 2012). Science in the Big Data era is an abduc- 
tive process where "hypotheses are developed to account for observed data" 
(Farmer and Pozdnoukhov2012, 5). 

The "Big Data perspective" for spatial information involves "using large- 
scale mobile data as input to characterize and understand real-life phenom- 



1 A significantly expanded version of this paper addressing academic study of spatial Big 
Data is presently under review at the International J ournal of Communication. 



ena" (Laurila et al. 2012, 1). At present, this mobile data comes from two 
predominant sources - Twitter and Foursquare- with the data captured 
and stored from mobile users of these applications is seen as containing 
"rich information" (Long et al. 2012). The Big Data viewpoint posits that 
check-ins and tweets reveal meaningful information that can be used by 
researchers to study society and by companies to increase profits. These two 
distinct goals are both found within the data already generated by end- 
users of certain applications, something referred to here as "Data Fumes." 

Since 2009, more than $115 million has been invested in location based 
start-up companies, making them a major factor in the generation, manipu- 
lation and access of spatial information data sets (Wilson 2012). Many of 
these start ups have focused around what can be called the "check-in." For 
the purposes of this paper, a "check-in" is defined as an action the end-user 
takes which broadcasts their supposed location at a given time and place. It 
is a purposeful, end-user initiated action which creates this data- point: "Us- 
ers check-in at venues where they are present, effectively reporting their 
I ocati on" vi a the appl i cati on to those who have access to the i nformati on - a 
group which may include friends, the makers of the application, other cor- 
porations, and researchers (Carbunar and Portharaju 2012). Check-ins may 
be distinguished from other more ubiquitous types of spatial tracking in 
that they are discrete events: a user checks- in at a movie theater and later 
checks-in at a restaurant, but the period between check-ins is not captured. 
Although other services exist, the dominant systems for checking-in at the 
moment are Foursquare and Facebook places. Although different sources 
cite different numbers, since 2009 it is estimated that Foursquare has seen 
over two and a half billion check-ins (Benner and Robles2012). Meanwhile, 
Facebook, with over two-hundred million active users, has had upwards of 
two billion actions tagged with locations in April of 2012 alone (Long et al. 
2012). 

Data fumes are attempts to "add value" to the check-in, to make it more 
meaningful and to profit from the provision of this meaning. Theterm itself 
comes from an interview with the CEO of a popular mobile spatial start-up: 

John (names have been obfuscated): In the end, [his appli- 
cations] all tie to location data, they're all sitting on top of 
other people's location data and behaviors that we already 
have when we're goi ng out, I ike checking- in... 

J ohn: Say Foursquare didn't exist, I 'd have to convince eve- 
ryone in the world to check- in and share their location. The 
fact that Foursquare exists and aims to get everyone in the 
world to check- in ... it means they can concentrate on that, 
and I can use all the data they're already collecting, the 



fumes, the exhaust of their data to do really interesting 
things. 

Data fumes refer directly to the data and information that is generated 
through actions the end- user already takes. An application which "livefs] on 
data fumes," as another interviewee, Paul, stated, is one which seeks to ac- 
cess and manipulate this already existing data in such a way as to "add val- 
ue" to the end-user's experience. As J ohn put it, to maximize the "cost ver- 
sus reward ratio." In so doing, this prefigures a relationship with existing 
spatial data and representations. Of the mobile spatial applications repre- 
sented in the interview set, only one was not building their product off of an 
existing platform. J ohn's most recent application, for example, uses Four- 
square check-in data displayed on a map created by another third-party 
vendor. In total, every application relied on mapping information provided 
by either Google or OpenStreetMap. See figure lfor an example of Local 
Mind's use of Foursquare's check-in locations with both applications using 
Google as their base layer on Android platform phones. Every interviewee 
had an acceptance that the data being generated, the millions of discrete 
check-ins, represented meaningful information on which to act. The exist- 
ing behavior offered a rich source on which to "add value," to capitalize on 
the Big Data sets already being generated. The next section of this article 
presents the two entwined problems with a reliance upon Data Fumes. 
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Figure L Local Mind on the left and Foursquare on the right, both use Google's 
basemap. 



3. Access and Control 

Data Fumes engage Big Data datasets with an inscription into and belief in 
the meaning of the data "given off" by actions end-users are already taken. 
This data is accepted as useful, meaningful, and 'Big.' These beliefs produce 
two entwined problems with the very nature of mobile Big Data: First, while 
the data sets may be big in size, their format, composition, and access is 
regulated by a very small number of individuals. Using information gener- 
ated by other applications forces a reliance upon the means through which 
that data is made avai I able. What can be known and what can be done with 
mobile spatial Big Data is delimited by the decisions of a very small set of 
programmers and businesses, their choices promulgating through the mo- 



bile ecosystem. Second, as these decisions are made by private corporations 
engaged in profit seeking behavior, "location" itself has become a commodi- 
ty able to be bid for, bought, and sold, but with little relation to actual phys- 
ical location. This section demonstrates the entwined nature of these prob- 
lems, tracing the limits placed on Big Data creation and access back to a 
small number of for-profit corporations and their developers. 

M uch of the allure of Data Fumes is that they build from data that already 
exists, that is already 'given off.' It lets start-ups focus on "addfing] value 
based on work you're already doing" (J ohn). However, both the form of the 
information accessed and the ability to access it are held entirely outside 
the control of the vast majority of individuals. Despite the size of the data 
set, its quality and characteristics are entirely outside the control of the 
mobile start-up. When a start-up builds their application using Foursquare 
check-in data, they are accepting that their users will access information 
through the venue- based dataset that structures the Foursquare check-i n. 

For private companies, the lack of control and guaranteed access to infor- 
mation presents a set of concerns. For example, in May of 2012, Four- 
square announced a change i n thei r publ i c API in order to prevent appl i ca- 
tions from accessing the information of users checked into venues other 
than that of the end-user (Thompson 2012). While this change was meant 
to prevent the creation of applications that could be used for stalking other 
users, like Girls Around Me (Brownlee 2012), the change in access to in- 
formation affected a host of unrelated applications. One interviewee had to 
abandon their project and begin a new one as there was no way to continue 
their current application. Assisted Serendipity, an application previously 
praised by the Foursquare CEO, was left unable to function and ceased de- 
velopment (Thompson 2012). Private companies run clear risks of shifting 
access to the information they require when relying upon the data given off 
by interests other than their own. Foursquare changed their API policy due 
to bad publicity, Twitter drastically restricted free access to their data when 
they found a market willing to pay for it; in each case, the decision to 
change was made by a single corporation, but the repercussions were felt 
along the entire data chain. What can be known, what can be done, and 
what can be shown are i nherently control I ed by those who control access to 
the data. 

In addition to basic concerns over access, reliance upon data given off by 
other corporations accepts an underlying view of individual and society 
shaped through the incentives that drive data creation and the imperatives 
of the corporations who drive it. The use of Twitter and Foursquare is built 
around a for-profit model in which end- users are incentivized to participate 
and the information they provide is commoditized and sold to advertisers. 



This has a double effect on the commoditization of data produced. On one 
end, end-users may manipulate the system providing false information in 
order to receive rewards. On the other, the organization and form of the 
data produced is structured in such a way as to standardize and quantify 
"location" in order to allow for its commodifi cation and exchange. 

It has been well established that the coordinate data provided by services 
like Twitter and Foursquare are not nearly as accurate as their decimal 
points suggest (Xu et al. 2012). For example, a Twitter user may set their 
location to Boston, Massachusetts, while the user has specified their ad- 
dress to the city level, geocoding these results produces a location specified 
to within a single meter. While potentially misleading, this issue is well 
known and researchers have proposed solutions (Hecht et al. 2011). Check- 
in information, the basis of most LBSNs I ike Foursquare, avoids this poten- 
tial distortion by having end- users check into geocoded venues. I n order to 
encourage use, Foursquare and other services offer rewards for checking 
i nto certai n venues. The fi rst ti me a user checks i nto a restaurant they may 
be offered discount on their meal or a free drink from the bar. Similarly, a 
customer may receive a permanent discount after checking into a location a 
certain number of times. By rewarding user participation, Foursquare and 
other services incentive location fraud. Location fraud is to "falsely claim to 
be at a location, to receive undeserved rewards or social status" (Carbunar 
and Potharaju 2012, 1). Incentivizing end-users to contribute data, so that 
that data may itself be sold as a commodity, results in distorting end-user 
behavi or to maxi mi ze the recei pt of rewards. 

For Foursquare and other LBSNs, this "fraud" doesn't matter: the data pro- 
vided is still valuable, they are capable of turning a profit, so whether an 
individual user is actually at a venue matters less than that the individual 
user claims to be at said venue. I n fact, Foursquare allows and encourages 
applications I ike Check In Take Out which allow users to check into distant 
restaurants in order to place take-out orders. Figure2 shows Check I n Take 
Out's use of Foursquare's basic map presentation. So long as the data pro- 
duced can be exchanged for a profit, the LBSN has no incentive to prevent 
this separation of physical location from check-in "location." From the per- 
spective of the end- user, whether they are actually at a check- in location 
matters less than the rewards they receive for setting their "location" to the 
venue. Where an end-user is physically located is less important than the 
quantified data representing "location." Whilethis has led some technology 
writers to repeatedly proclaim the death of the check-in (Mitchell 2012), it 
can more general I y be seen as part of a shift i n focus of mobi I e appl i cati ons 
from simply recording location and providing destinations to shaping con- 
sumptions patterns of users (Thatcher 20 13; Wilson 2012). 



This framework of data generation is driven by an innate profit motive - 
end- users receive discounts, Foursquaresells data to partners, partners use 
data to drive consumption - that is completely divorced from an accurate 
representation of physical location. Researchers who make use of this data 
are inherently accepting this framing for their research. Abductive explora- 
tion of Big Data may reveal patterns, but it reveals patterns of "location" as 
a commodity. Movement patterns found within Foursquare information 
reflect movement first encouraged and then shaped by motives for profit. A 
behavioral loop is created for both the end- user and those who study them: 
"A person feeds in data, which is collected by an algorithm that then pre- 
sents the user with choices, thus steering behavior" (Lohr 20 12). 

Along with the opportunities offered by massive spatial data sets comes a 
set of restrictions both on how data can be accessed and what can be known 
through it. Spatial Big Data currently involves data created and shaped by a 
motive for profit. Whilethis does not necessarily matter for a start-up com- 
pany, it has profound repercussions for academic researchers. The next 
section offers a potential means of addressing the issues at stake in the re- 
search of spatial Big Data. It suggests looking to a series of parallel debates 
that occurred withi n the Gl S research community throughout the 1990s as 
a means of deepening theoretical critiques and understanding of Big Data. 



4. Conclusion: Big Data and GIS 

With upwards of 80% of all data stored by businesses and governments 
having a spatial component, it has been suggested that GIS scholars hold a 
"home field advantage" when it comes to the study of Big Data (Farmer and 
Pozdnoukhov2012). I n addition to expertise in the handling and analysis of 
large, spatially- referenced data sets, the debates which have and continueto 
occur within the GIS research community speak directly to the concerns 
raised in the previous sections. 

Fi rst, the data sets researchers rely upon are i ncreasi ngly generated through 
and controlled by privately held corporations. This raises distinct concerns 
on how what can be known has become regulated by a small set of corpo- 
rate entities. This parallels discussions of a rising technocracy controlling 
knowledge production in GIS. 'This technocracy is hidden in the offices of 
the vendors that develop the hardware and software and make the technol- 
ogy more generally accessible" (Obermeyer 1995, 78). As GIS technology 
advanced, the mediation by technology receded from active consideration. 
Like the telephone, GIS became a "transparent" technology used without 
conscious consideration of the underlying technical processes (ibid., 81). A 
similar situation has arisen for Big Data researchers: APIs standardize the 



process of access, structuring the data, but they also set the limits of what 
the data contains. Burgess and Bruns (2012) have demonstrated the very 
heuristics of their analysis are shaped by the format of the data available 
through the API used. Foursquare's recent API change demonstrates that 
the very content of the data, and therefore what can be known through it, 
are subject to regulation and alteration outside of the control of researchers. 
The opaque process of corporate decision- making serves as the hidden 
technocracy of Bi g Data. 

Finally, the reliance upon data generated with an explicit motive for profit - 
both for the end-user and the corporation - results in epistemological 
commitments not dissimilar to concerns raised with regards to the knowl- 
edges and approaches privileged by Gl S use. For Gl S, and now for Big Data, 
there is a need to distinguish between "empirical and technical claims about 
objects, practices, and institutions," the discourses within which these 
claims, and claims to truth, are made (Pickles 1995, 23). Big Data, likeGIS, 
accepts that a certain quantitative representation of life can stand in for its 
full meaning (Curry 1997). Further, in this inscription of meaning, Big Data 
must be seen as directly producing new knowledge rather than simply re- 
vealing it. The "hard work of theory" (Pickles 1997, 370) will tie Big Data 
directlyto much longer traditions of social theory and social thought as they 
have engaged technology. Cartographers and GIS researchers have drawn 
productively from Benjamin (Kingsbury and J ones 2009), Heidegger (Pick- 
les 1995), Foucault (Harley 1989) and other classical social theory perspec- 
tives in deepening an understanding of the relation between the specific 
technological form and the knowledges produced. 

This paper has outlined some of the concerns with an increased reliance 
upon Data Fumes. Drawing from earlier 'systems vs. science' debates, and 
leveraging their 'home field advantage,' Cartographers and Gl S researchers 
are in a unique position to contribute to the emerging field of Big Data Re- 
search. What can be put 'on the map' with spatial Big Data is now a product 
of, fi rst, what is captured and, then, what data is made avai lable by its own- 
ers - a group often distinct from who actually produced the data. Rather 
than letting these numbers 'speak for themselves,' Cartographers and GIS 
researchers should draw from their own long history to ground spatial Big 
Data and its practices in a more nuanced understanding of visualization, 
representation, and power. 
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